#Importing Libraries
#Data manipulation
import pandas as pd
import numpy as np
#Data visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
The goal of this project is to do exploratory data analysis on the Netflix dataset. Given the wide scope of possibilities in this dataset, it might be interesting to have some business questions to explore and answer, such as:
#Loading Dataset
df = pd.read_csv('Leadzai_DS_r&s_Exercise01_netflix.csv')
#Initial Data Inspection
df.head()
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
df.info()
df.describe(include = 'all')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8807 entries, 0 to 8806 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 8807 non-null object 1 type 8807 non-null object 2 title 8807 non-null object 3 director 6173 non-null object 4 cast 7982 non-null object 5 country 7976 non-null object 6 date_added 8797 non-null object 7 release_year 8807 non-null int64 8 rating 8803 non-null object 9 duration 8804 non-null object 10 listed_in 8807 non-null object 11 description 8807 non-null object dtypes: int64(1), object(11) memory usage: 825.8+ KB
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8807 | 8807 | 8807 | 6173 | 7982 | 7976 | 8797 | 8807.000000 | 8803 | 8804 | 8807 | 8807 |
| unique | 8807 | 2 | 8807 | 4528 | 7692 | 748 | 1767 | NaN | 17 | 220 | 514 | 8775 |
| top | s3461 | Movie | Badha | Rajiv Chilaka | David Attenborough | United States | January 1, 2020 | NaN | TV-MA | 1 Season | Dramas, International Movies | Paranormal activity at a lush, abandoned prope... |
| freq | 1 | 6131 | 1 | 19 | 19 | 2818 | 109 | NaN | 3207 | 1793 | 362 | 4 |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2014.180198 | NaN | NaN | NaN | NaN |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.819312 | NaN | NaN | NaN | NaN |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1925.000000 | NaN | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013.000000 | NaN | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2017.000000 | NaN | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019.000000 | NaN | NaN | NaN | NaN |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2021.000000 | NaN | NaN | NaN | NaN |
df.isnull().sum()
show_id 0 type 0 title 0 director 2634 cast 825 country 831 date_added 10 release_year 0 rating 4 duration 3 listed_in 0 description 0 dtype: int64
df.duplicated().sum()
0
df.nunique()
show_id 8807 type 2 title 8807 director 4528 cast 7692 country 748 date_added 1767 release_year 74 rating 17 duration 220 listed_in 514 description 8775 dtype: int64
Before proceeding with the Exploratory Data Analysis (EDA), we have some findings:
#drop show_id
df.drop('show_id', axis=1,inplace=True)
#convert date_added to datetime
df.date_added = pd.to_datetime(df.date_added)
pd.to_datetime(df.date_added).min()
Timestamp('2008-01-01 00:00:00')
pd.to_datetime(df.date_added).max()
Timestamp('2021-09-25 00:00:00')
Important Note: Since our dataset ends in September 2021 it seems 2021 was cut short, hence that year might reflect smaller values than in reality
What type of production is more prevalent in Netflix (tv show vs movie)?
#Getting proportions for movie & tv shows
ratio = pd.DataFrame([df[df.type == 'Movie']['type'].count()/df['type'].count(), df[df.type == 'TV Show']['type'].count()/df['type'].count()])
ratio.index = ['Movies','TV Shows']
string_perc = ['{:.1%}'.format(value) for value in ratio[0].values] #getting percentage format
# Create a stacked bar chart
fig = px.bar(ratio,text=string_perc, orientation='h', color=ratio.index, color_discrete_sequence= px.colors.diverging.Portland)
fig.update_layout(
title="The most common production in Netflix are Movies",
xaxis_title="",
yaxis_title="",
xaxis_showticklabels=False,
showlegend=False)
fig.show()
How are productions distributed across the world given their country of origin?
#Many instances have more than a country and have a "dirty" format such as ", France, Algeria "
#We will assume the first country listed as the official country of production & provide a clean format
df['official_country'] = df['country'].dropna('').apply(lambda x: x.split(",")[0])
#Create dataframe with production countries & counts
df_map = pd.DataFrame()
df_map['Countries'] = df.groupby(['official_country'])['official_country'].count().index
df_map['Count'] = df.groupby(['official_country'])['official_country'].count().values
#The following country codes were obtained with the help of Chat gpt to obtain ISO Alpha_3 Country codes
#These codes will be used for creating a world map plot
country_codes = {'afghanistan': 'AFG', 'albania': 'ALB', 'algeria': 'DZA', 'american samoa': 'ASM',
'andorra': 'AND', 'angola': 'AGO', 'anguilla': 'AIA', 'antigua and barbuda': 'ATG',
'argentina': 'ARG', 'armenia': 'ARM', 'aruba': 'ABW', 'australia': 'AUS',
'austria': 'AUT', 'azerbaijan': 'AZE', 'bahamas': 'BHM', 'bahrain': 'BHR',
'bangladesh': 'BGD', 'barbados': 'BRB', 'belarus': 'BLR', 'belgium': 'BEL',
'belize': 'BLZ', 'benin': 'BEN', 'bermuda': 'BMU', 'bhutan': 'BTN',
'bolivia': 'BOL', 'bosnia and herzegovina': 'BIH', 'botswana': 'BWA', 'brazil': 'BRA',
'british virgin islands': 'VGB', 'brunei': 'BRN', 'bulgaria': 'BGR', 'burkina faso': 'BFA',
'burma': 'MMR', 'burundi': 'BDI', 'cabo verde': 'CPV', 'cambodia': 'KHM',
'cameroon': 'CMR', 'canada': 'CAN', 'cayman islands': 'CYM', 'central african republic': 'CAF',
'chad': 'TCD', 'chile': 'CHL', 'china': 'CHN', 'colombia': 'COL',
'comoros': 'COM', 'congo democratic': 'COD', 'Congo republic': 'COG', 'cook islands': 'COK',
'costa rica': 'CRI', "cote d'ivoire": 'CIV', 'croatia': 'HRV', 'cuba': 'CUB',
'curacao': 'CUW', 'cyprus': 'CYP', 'czech republic': 'CZE', 'denmark': 'DNK',
'djibouti': 'DJI', 'dominica': 'DMA', 'dominican republic': 'DOM', 'ecuador': 'ECU',
'egypt': 'EGY', 'el salvador': 'SLV', 'equatorial guinea': 'GNQ', 'eritrea': 'ERI',
'estonia': 'EST', 'ethiopia': 'ETH', 'falkland islands': 'FLK', 'faroe islands': 'FRO',
'fiji': 'FJI', 'finland': 'FIN', 'france': 'FRA', 'french polynesia': 'PYF',
'gabon': 'GAB', 'gambia, the': 'GMB', 'georgia': 'GEO', 'germany': 'DEU',
'ghana': 'GHA', 'gibraltar': 'GIB', 'greece': 'GRC', 'greenland': 'GRL',
'grenada': 'GRD', 'guam': 'GUM', 'guatemala': 'GTM', 'guernsey': 'GGY',
'guinea-bissau': 'GNB', 'guinea': 'GIN', 'guyana': 'GUY', 'haiti': 'HTI',
'honduras': 'HND', 'hong kong': 'HKG', 'hungary': 'HUN', 'iceland': 'ISL',
'india': 'IND', 'indonesia': 'IDN', 'iran': 'IRN', 'iraq': 'IRQ',
'ireland': 'IRL', 'isle of man': 'IMN', 'israel': 'ISR', 'italy': 'ITA',
'jamaica': 'JAM', 'japan': 'JPN', 'jersey': 'JEY', 'jordan': 'JOR',
'kazakhstan': 'KAZ', 'kenya': 'KEN', 'kiribati': 'KIR', 'north korea': 'PRK',
'south korea': 'KOR', 'kosovo': 'KSV', 'kuwait': 'KWT', 'kyrgyzstan': 'KGZ',
'laos': 'LAO', 'latvia': 'LVA', 'lebanon': 'LBN', 'lesotho': 'LSO',
'liberia': 'LBR', 'libya': 'LBY', 'liechtenstein': 'LIE', 'lithuania': 'LTU',
'luxembourg': 'LUX', 'macau': 'MAC', 'macedonia': 'MKD', 'madagascar': 'MDG',
'malawi': 'MWI', 'malaysia': 'MYS', 'maldives': 'MDV', 'mali': 'MLI',
'malta': 'MLT', 'marshall islands': 'MHL', 'mauritania': 'MRT', 'mauritius': 'MUS',
'mexico': 'MEX', 'micronesia': 'FSM', 'moldova': 'MDA', 'monaco': 'MCO',
'mongolia': 'MNG', 'montenegro': 'MNE', 'morocco': 'MAR', 'mozambique': 'MOZ',
'namibia': 'NAM', 'nepal': 'NPL', 'netherlands': 'NLD', 'new caledonia': 'NCL',
'new zealand': 'NZL', 'nicaragua': 'NIC', 'nigeria': 'NGA', 'niger': 'NER',
'niue': 'NIU', 'northern mariana islands': 'MNP', 'norway': 'NOR', 'oman': 'OMN',
'pakistan': 'PAK', 'palau': 'PLW', 'panama': 'PAN', 'papua new guinea': 'PNG',
'paraguay': 'PRY', 'peru': 'PER', 'philippines': 'PHL', 'poland': 'POL',
'portugal': 'PRT', 'puerto rico': 'PRI', 'qatar': 'QAT', 'romania': 'ROU',
'russia': 'RUS', 'rwanda': 'RWA', 'saint kitts and nevis': 'KNA', 'saint lucia': 'LCA',
'saint martin': 'MAF', 'saint pierre and miquelon': 'SPM', 'saint vincent and the grenadines': 'VCT', 'samoa': 'WSM',
'san marino': 'SMR', 'sao tome and principe': 'STP', 'saudi arabia': 'SAU', 'senegal': 'SEN',
'serbia': 'SRB', 'seychelles': 'SYC', 'sierra leone': 'SLE', 'singapore': 'SGP',
'sint maarten': 'SXM', 'slovakia': 'SVK', 'slovenia': 'SVN', 'solomon islands': 'SLB',
'somalia': 'SOM', 'south africa': 'ZAF', 'south sudan': 'SSD', 'spain': 'ESP',
'sri lanka': 'LKA', 'sudan': 'SDN', 'suriname': 'SUR', 'swaziland': 'SWZ',
'sweden': 'SWE', 'switzerland': 'CHE', 'syria': 'SYR', 'taiwan': 'TWN',
'tajikistan': 'TJK', 'tanzania': 'TZA', 'thailand': 'THA', 'timor-leste': 'TLS',
'togo': 'TGO', 'tonga': 'TON', 'trinidad and tobago': 'TTO', 'tunisia': 'TUN',
'turkey': 'TUR', 'turkmenistan': 'TKM', 'tuvalu': 'TUV', 'uganda': 'UGA',
'ukraine': 'UKR', 'united arab emirates': 'ARE', 'united kingdom': 'GBR', 'united states': 'USA',
'uruguay': 'URY', 'uzbekistan': 'UZB', 'vanuatu': 'VUT', 'venezuela': 'VEN',
'vietnam': 'VNM', 'virgin islands': 'VGB', 'west bank': 'WBG', 'yemen': 'YEM',
'zambia': 'ZMB', 'zimbabwe': 'ZWE'}
#assigning iso code to production df
df_map['iso_a3'] = df_map['Countries'].str.lower().map(country_codes)
#Plotting productions of the 50 countries with most productions
fig = px.histogram(df_map.sort_values(by='Count', ascending = False).head(50), x='Countries', y = 'Count')
fig.update_xaxes(categoryorder='total descending')
fig.update_layout(
#width=1080, # Width in pixels
height=440, # Height in pixels
xaxis=dict(
title='Countries',
tickmode='linear'),
yaxis_title="Number of Productions",
title="Some countries hold the majority of productions"
)
fig.show()
By looking at the plot above we observe a few countries have most of the total productions while there are many countries with few productions. The US has the most productions, followed by India and UK. To get a better insight on how productions are distributed geographically, we will plot the total productions per nation on the world map. Furthermore, while we only plotted the 50 countries with the most productions above, there is a total of 86.
#BUILDING WORLD MAP
# To take this skewness into account we create a colorscale able to highlight changes in those countries with few productions
custom_colors = [
[0.0, 'white'],
[0.05, 'lightCyan'],
[0.1, 'paleTurquoise'],
[0.2, 'Turquoise'],
[0.8, 'navy'],
[1.0, 'midnightBlue'],
]
# Create a Choropleth map
fig = go.Figure(data=go.Choropleth(
locations = df_map[1:]['iso_a3'],
z = df_map[1:]['Count'],
text = df_map[1:]['Countries'],
colorscale = custom_colors,
autocolorscale=False,
reversescale=False,
marker_line_color='black',
marker_line_width=1,
colorbar_tickprefix = '',
colorbar_title = 'Number of Productions',
))
# Set the layout of the map
fig.update_layout(
title_text='Netflix productions per nation',
geo=dict(
showframe=False,
showcoastlines=False,
projection_type='equirectangular'
),
annotations = [dict(
x=0.5,
y=0,
xref='paper',
yref='paper',
text='Source: Netflix Dataset',
showarrow = False
)]
)
fig.write_image("netflix_productions.png",width=1920,height=1080)
fig.show()
#Note: the following graph is interactive and the cursor can hover by countries for further info
In terms of regions, North America is by far the one with the most productions. While we find most continents seem to have productions for the majority of countries, Africa has only a few countries with productions (with Egypt, Nigeria and SA having the majority). Given Some continents have many small sized countries (case of Europe and Asia to some extent) it might be interesting to group countries by continent to get a better understanding of how production numbers are distributed across the globe.
#Getting respective Continents, obtained with ChatGPT's help
country_continent_dict = {
'ARG': 'South America', 'AUS': 'Oceania',
'AUT': 'Europe', 'BGD': 'Asia', 'BLR': 'Europe',
'BEL': 'Europe','BRA': 'South America',
'BGR': 'Europe','KHM': 'Asia',
'CMR': 'Africa','CAN': 'North America',
'CHL': 'South America','CHN': 'Asia',
'COL': 'South America','HRV': 'Europe',
'CYP': 'Europe','CZE': 'Europe',
'DNK': 'Europe','EGY': 'Africa',
'FIN': 'Europe','FRA': 'Europe',
'GEO': 'Asia','DEU': 'Europe',
'GHA': 'Africa','GRC': 'Europe',
'GTM': 'North America','HKG': 'Asia',
'HUN': 'Europe','ISL': 'Europe',
'IND': 'Asia','IDN': 'Asia',
'IRN': 'Asia','IRL': 'Europe',
'ISR': 'Asia','ITA': 'Europe',
'JAM': 'North America','JPN': 'Asia',
'JOR': 'Asia','KEN': 'Africa',
'KWT': 'Asia','LBN': 'Asia',
'LUX': 'Europe','MYS': 'Asia',
'MUS': 'Africa','MEX': 'North America',
'MOZ': 'Africa','NAM': 'Africa',
'NLD': 'Europe','NZL': 'Oceania',
'NGA': 'Africa','NOR': 'Europe',
'PAK': 'Asia','PRY': 'South America',
'PER': 'South America','PHL': 'Asia',
'POL': 'Europe','PRT': 'Europe',
'PRI': 'North America','ROU': 'Europe',
'RUS': 'Europe','SAU': 'Asia',
'SEN': 'Africa','SRB': 'Europe',
'SGP': 'Asia','SVN': 'Europe',
'SOM': 'Africa','ZAF': 'Africa',
'KOR': 'Asia','ESP': 'Europe',
'SWE': 'Europe','CHE': 'Europe',
'SYR': 'Asia','TWN': 'Asia',
'THA': 'Asia','TUR': 'Asia',
'UKR': 'Europe','ARE': 'Asia',
'GBR': 'Europe','USA': 'North America',
'URY': 'South America','VEN': 'South America',
'VNM': 'Asia','ZWE': 'Africa'
}
#Getting df with continent data
df_map['continent'] = df_map['iso_a3'].map(country_continent_dict)
df_map_continents_temp = df_map.groupby('continent')['Count'].sum().reset_index()
#Plotting productions per continent
fig = px.histogram(df_map.dropna(), x='continent', y = 'Count', color='Countries')#, color_discrete_sequence=['mediumTurquoise','lightSeaGreen','Teal','royalBlue','darkBLue','midnightBlue'])
fig.update_xaxes(categoryorder='total descending')
fig.update_layout(
width=720, # Width in pixels
height=720, # Height in pixels
xaxis=dict(
title='Continents',
tickmode='linear'),
yaxis_title="Number of Productions",
title="The majority of productions are located in the continents of the Northern Hemisphere",
title_font_size=15,
showlegend=False,
)
fig.show()
#Note: the following graph is interactive and the cursor can hover by columns for further info
When it comes to total productions, North America leads by a margin of 60.2% compared to Asia and 148.73% compared to Europe.
An interesting finding regarding the distribution of productions is the fact North America is mostly made of the US, whereas Asia and Europe have productions from many different countries, some of them which we might consider fairly small sized but have a strong impact in the total number of productions.
Despite the low density of productions in European and Asian countries (with the exception of the UK and India) these two continents still have a considerable sum of total productions. Continents from the Southern Hemisphere have low total productions, with a handful of nations being responsible for most of the production.
How are productions distributed in terms of release year?
ax = sns.displot(df, x="release_year", kde=True, height=8, aspect=1920/1080, binwidth=1, hue='type', multiple="stack")
plt.xlabel('Release Year', size=14)
plt.ylabel('Productions', size=14)
plt.title('Number of Productions per year released', size=20)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
We observe most productions have been released in the late 2010s and close to the year of this dataset (2021). The number of production released decreases drastically before 2010. To further Corroborate this statement, we observed during the "Data Inspection" phase that at least 50% of the productions were released between 2017 and 2021. Another interesting finding is that TV Shows appear to have become more trendy when it comes to Movies, as Movies ruled the majority of releases historically but as we approach 2021, TV Shows are nearly half of the total releases.
When did Netflix add the most productions?
#First we group date_added by year
df['year_added'] = df.date_added.dt.year
#Create dataframe with year added & counts
df_year = pd.DataFrame()
df_year['Year'] = df.groupby(['year_added'])['year_added'].count().index
df_year['Count'] = df.groupby(['year_added'])['year_added'].count().values
#Plotting graph
fig = px.bar(df_year.sort_values(by='Year', ascending = False), x='Year', y = 'Count')
fig.update_layout(
height=440, # Height in pixels
xaxis=dict(
title='Year',
tickmode='linear'),
yaxis_title="Number of Productions Added",
title="Most Productions were added between 2016 and 2021 (2021 ends in September)"
)
fig.update_traces(marker_color='Navy',
marker_line_width=1.5, opacity=0.8)
fig.show()
What is the most Common duration for movies & for tv shows?
For Movies
#Creating a temporary df with Movies only
movie_length = df.loc[df.type == 'Movie'].copy()
#Getting time length as a number
movie_length['duration_minutes'] = movie_length.duration.str[:-4].astype(float)
#Plotting
fig, ax = plt.subplots(figsize=(10, 5))
sns.kdeplot(movie_length.duration_minutes, shade = True, ax=ax)
plt.xlabel('Length in minutes', size=12)
plt.ylabel('Density', size=12)
plt.title('Most Movies are about 100 minutes long', size=16)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
We observe the majority of Movies are at least 1 hour (60 minutes) long, peaking at around 100 minutes, and very few movies make it past the 2.5 hour (150 minutes) mark.
For Movies
#Creating a temporary df with TV Shows only
show_length = df.loc[df.type == 'TV Show'].copy()
#getting show length as integer (number of season) for sorting purposes if needed
show_length['Seasons'] = show_length.duration.str.extract('(\d+)', expand=False).astype(int)
#Counting amount of tv shows per length in seasons
Duration_counts=show_length.groupby(by='Seasons')['duration'].count()
#Plotting
fig = px.bar(Duration_counts)
fig.update_layout(
yaxis_type="log",
height=440,
xaxis=dict(
title='Total Seasons',
tickmode='linear'),
yaxis_title="Number of Productions",
title="Duration of TV Shows in Seasons"
)
fig.update_traces(marker_color='teal',
marker_line_width=1.5, opacity=0.8)
fig.show()
Most TV Shows have only 1 season. The amount of shows with 2 or more seasons starts to decrease exponentially (hence why we use a logarithmic scale in the graph above). Nevertheless we have some outliers such as a case of two TV Shows with 15 seasons and one with 17 seasons.
What are the most common production genres?
# We first get a list of all Genres values appearing in the dataset
# For counting purposes if a production has 2 or more genres, we will include both genres separately in the list
Starred_genres = []
for item in list(df.listed_in.values):
# Check if the string contains a comma
if "," in item:
# Split the string into multiple values using comma as the delimiter
values = item.split(", ")
# Extend the output list with the separated values
Starred_genres.extend(values)
else:
# If no comma is found, simply append the original string
Starred_genres.append(item)
#defining a temporary dataframe for Starred genres and their counts
data = dict(genres=pd.Series(Starred_genres).value_counts().index,
counts=pd.Series(Starred_genres).value_counts().values)
df_genres = pd.DataFrame(data)
# Create the Sunburst chart
fig = px.sunburst(df_genres, path=['genres'], values='counts')
fig.update_layout(
title_text="Most common Genres",
title_x=0.5,
title_font_size=20
)
fig.show()
Interestingly, International Movies and International TV Shows are some of the most popular genres which refers to productions made outside the US having the predominant languange other than english. Aside from these 2 genres we have Dramas, Comedies, Documentaries, Action & Adventure as some of the most popular.
Since the US is the biggest producer of films, it might be an interesting exercise to see how the most popular genres differ in US productions (assuming the "International" genres should not exist for american films)
# We first get a list of all Genres values appearing in the dataset
# For counting purposes if a production has 2 or more genres, we will include both genres separately in the list
Starred_genres_US = []
for item in list(df[df.country == 'United States'].listed_in.values):
# Check if the string contains a comma
if "," in item:
# Split the string into multiple values using comma as the delimiter
values = item.split(", ")
# Extend the output list with the separated values
Starred_genres_US.extend(values)
else:
# If no comma is found, simply append the original string
Starred_genres_US.append(item)
#defining a temporary dataframe for Starred genres and their counts
data = dict(genres=pd.Series(Starred_genres_US).value_counts().index,
counts=pd.Series(Starred_genres_US).value_counts().values)
df_genres_US = pd.DataFrame(data)
# Create the Sunburst chart
fig = px.sunburst(df_genres_US, path=['genres'], values='counts')
fig.update_layout(
title_text="Most common genres excluding International Productions (USA only)",
title_x=0.5,
title_font_size=16
)
fig.show()
When isolating the United States, the genres Dramas, Comedies and Documentaries are still on the top of productions, but surprisingly Children & Family Movies are more common than Action & Adventure. TV Dramas seems to be particularly less popular when compared to global productions.
How are productions distributed across different ratings?
df.rating.unique()
array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R',
'TV-G', 'G', 'NC-17', '74 min', '84 min', '66 min', 'NR', nan,
'TV-Y7-FV', 'UR'], dtype=object)
From the Unique rating values we found 3 instances of movies whose "duration" is in the rating column: '74 min', '84 min', '66 min'. The 3 instances in question belong to the director Louis C.K. and since it appears to be an error we will drop those instances for our rating analysis.
#Dropping wrong ratings
rating_df = df[~df['rating'].isin(['74 min', '84 min', '66 min'])]
#Getting the breakdown of ratings
rating_counts = rating_df.groupby(by='rating')['rating'].count().rename('rating_counts').reset_index().sort_values(by='rating_counts', ascending=False)
#To visualize the breakdown of rating we will build a basic radar plot
#Setting the data
# Values for the x axis
ANGLES = np.linspace(0.05, 2 * np.pi - 0.05, len(rating_counts), endpoint=False)
# Cumulative Ratings
Ratings = rating_counts["rating_counts"].values
# Rating label
Label = rating_counts["rating"].values
#Setting Colors
GREY12 = "#1f1f1f"
# Set default font to Bell MT
plt.rcParams.update({"font.family": "Bell MT"})
# Set default font color to GREY12
plt.rcParams["text.color"] = GREY12
# The minus glyph is not available in Bell MT
# This disables it, and uses a hyphen
plt.rc("axes", unicode_minus=False)
# Colors
COLORS = ["paleTurquoise","midnightBlue","navy"]
# Colormap
cmap = mpl.colors.LinearSegmentedColormap.from_list("my color", COLORS, N=256)
# Normalizer
norm = mpl.colors.Normalize(vmin=rating_counts['rating_counts'].min(), vmax=rating_counts['rating_counts'].max())
# Normalized colors. Each number of tracks is mapped to a color in the
# color scale 'cmap'
COLORS = cmap(norm(rating_counts['rating_counts']))
# Initialize layout in polar coordinates
fig, ax = plt.subplots(figsize=(8, 11), subplot_kw={"projection": "polar"})
# Set background color to white, both axis and figure.
fig.patch.set_facecolor("white")
ax.set_facecolor("white")
ax.set_theta_offset(1.2 * np.pi / 2)
ax.set_ylim(-1500, 3500)
# Add bars to represent the cumulative track lengths
ax.bar(ANGLES, Ratings, color=COLORS, alpha=0.9, width=0.4, zorder=14)
# Set the labels
ax.set_xticks(ANGLES)
ax.set_xticklabels(Label, size=13)
#set title
ax.set_title('Number of productions per rating', size = 20);
Most Productions are either TV-MA or TV-14 which implies most productions are not oriented for children.
It might be interesting to further investigate the breakdown of ratings, including to what segment of the population these productions most cater to (eg. Adults, teens or children)
#Getting population segment in age for ratings with the help of chat gpt
rating_dict = {
'PG-13': 'Teens',
'TV-MA': 'Adults',
'PG': 'Children',
'TV-14': 'Teens',
'TV-PG': 'Children',
'TV-Y': 'Children',
'TV-Y7': 'Children',
'R': 'Adults',
'TV-G': 'Children',
'G': 'Children',
'NC-17': 'Adults',
'NR': 'Adults',
'TV-Y7-FV': 'Children',
'UR': 'Adults'
}
#Applying age groups
rating_counts['age_group'] = rating_counts.rating.map(rating_dict)
#Plotting
fig = px.bar(rating_counts, x='rating_counts', y='rating', color="age_group", orientation='h',
title="Number of productions per rating with age groups")
fig.show()
#Note: the following graph is interactive and the cursor can hover by columns for further info
Although adult restricted ratings are the most common, Netflix still has a considerable number of productions which can be viewed by the younger population segments:
Have new productions changed rating distribution across time?
#First we need to get productions grouped by year added and then get proportions of ratings for each year
#we are going to use the rating_df which already treated the wrong instances
rating_year_df = rating_df.groupby(['rating','year_added']).count()[['type']].reset_index()
#Get proportions
rating_year_df['percentage'] = rating_df.groupby(['rating','year_added']).size().groupby(level=1).apply(lambda x: 100 * x / float(x.sum())).values
#getting age_group
rating_year_df['age_group'] = rating_year_df.rating.map(rating_dict)
fig = px.bar(rating_year_df, x='year_added', y='percentage', color="rating",
title="Rating breakdown of added productions, per year")
fig.show()
#Note: the following graph is interactive and the cursor can hover by columns for further info
The ratings of productions added vary greatly in the first years until 2014, which as we've seen were years of considerably less productions added than in the following years. From 2015 onward we see some repeating patterns in how the proportions of ratings behave: some changes are still observable which might reflect how netflix is adapting their released productions to different segments of their users. It might be interesting to plot the evolution of age groups to test this theory.
#Plotting
fig = px.bar(rating_year_df.groupby(['year_added','age_group']).sum().reset_index(), x='year_added', y='percentage', color='age_group',
title="Age group breakdown of added productions, per year")
fig.show()
#Note: the following graph is interactive and the cursor can hover by columns for further info
As the number of released productions rose from 2015 onwards, Netflix seemed to maintain a fairly stable proportion of ratings respective to the different segments of the population (adults, teens and children), yet there seems to be a sligthly rise in productions fit for teenagers to the detriment of adding less productions catered to children. In summary, in the past few years, Adult restricted productions make slightly less than half of added productions, followed by teenage admitted productions and then children admitted productions.
In conclusion, we found some key insights on the Netflix dataset: Most Netflix productions are movies and have been produced in the Northern Hemisphere (with the US being a major player); most productions have been both released and added in recent years (up until September 2021); Most movies are around 100 minutes and most TV Shows don't make it past a couple season; The most popular genres aside international movies and tv shows are Dramas, Comedies and Documentaries; Productions with the TV-MA & TV-14 are the most common and Netflix has remained fairly stable in the age groups the ratings of added productions cater to.
The dataset in question is fairly limited since despite holding information about the nature of Netflix's productions (namely TV-Shows and Movies), it lacks any user info such as how the users could rate each production, or how many times productions are viewed, as well as other potencial metrics and preferences. Nevertheless, one could argue that after this EDA it would be easier to find potential users that may show interest in productions based on preferences of genre, country of production, rating, duration or any of the other insights we found.